EDA

In this section, we showcase our primary dataset as well as supplementary datasets to get the bigger picture of what data we are working with.

The goal of this section is to explore how we can tentatively use our data in tandem with strategies and techniques found from our literature review in order to profile syndemic relationships for type II diabetes.

Packages

Demographic Data

With our research goal of evaluating how the social and demographic factors interact with diabetes in a syndemic relationship, it is important to understand the demographic breakdown of the group that was studied by the National Health and Nutrition Examination Study (NHANES).

The demographic data set for the study includes 15560 observations of 29 variables including information on race, gender, family income, education level, and language spoken. Names and summaries of each of the variables are shown below.

Missing Values

The data set has 682 missing values for the age variable and 2201 missing values for the ratio_family_income_poverty variable.

Distribution of Continuous Variables

Note:

The Department of Health and Human Services (HHS) poverty guidelines were used as the poverty measure to calculate this ratio. So, the ratio was calculated as:

Ratio = (Total Annual Income)/(Poverty Guideline specific to each year)

Distribution of Categorical Variables

demographics <- demographics[!is.na(demographics$education_level), ]

Gender and Education Stratified by Race

To understand how confounding variables may affect our analysis, it is important to compare the distributions of various factors such as gender and education level by other demographic factors such as race.

Diabetes Data

The diabetes data set from the National Health and Nutrition Examination Study contains information of diagnosis and progression of disease for each participant in the study. This dataset contains 28 variables which include when participants were diagnosed, whether or not they are on insulin, how frequently they see a doctor, etc. Names and summaries of each of the variables are shown below.

Missing Values

There are missing values in the age_informed, insulin_length, num_dr_visits_past_year, and how_often_glucose_check variables. These missing values are likely for participants who have not been informed of a diabetes diagnosis.

Distribution of Diagnostic Variables

Health and Nutrition Data

The health and nutritional behavior data details participant’s food choices, such as Breastfeeding and other childhood feeding practices, Frequency of getting meals prepared away from home, Frequency of getting meals from fast food or pizza places, Use of convenience foods, and knowledge of the my plate program. Names and summaries of variables are shown below. The data represent 15560 individuals with 46 different variables observed.

Column Names: 

1. respondent_sequence_num
2. ever_breastfed_or_fed_breastmilk
3. age_stopped_breastfeeding_days
4. diet_healthiness
5. community_government_meals_delivered
6. eat_meals_at_community_senior_center
7. attend_kindergarten_thru_high_school
8. school_serves_school_lunches
9. school_serves_complete_breakfast_daily
10. summer_program_meal_free_reduced_price
11. meals_not_home_prepared_count
12. meals_from_fast_food_or_pizza_place_count
13. ready_to_eat_foods_past_30_days
14. frozen_meals_pizza_past_30_days

Data Types & Missing Values

Breastfeeding and Weaning

Table of respondents fed breast milk or breastfed:

  Value Frequency Percentage
1   Yes      2066   78.73476
2    No       558   21.26524

Summary Statistics for age stopped breastfeeding in days:

  mean_age_stopped_breastfeeding median_age_stopped_breastfeeding
1                       198.6769                              121
  sd_age_stopped_breastfeeding min_age_stopped_breastfeeding
1                     218.0595                  5.397605e-79
  max_age_stopped_breastfeeding
1                          1095

Nutritional Practices

Social Meal Support

Education

Table of respondents who attended kindergartedn through highschool:

  Value Frequency Percentage
1   Yes      3849   78.73476
2    No       753   21.26524

Laboratory Data

There are 43 XPT data of laboratory tested data taken from the NHANES website. With so many XPT files of laboratory data, the cleaned dataset therefore contains 337 columns of variables. Many are strongly correlated with each other as some variables are the same just in a different metric. Due to how many XPT files are being combined and how many variables exist in each file, manually removing these highly correlated columns was not done. Additionally after combining each file to a common Respondent Sequence ID number, many missing values exist in each row. There are missing values in each row due to the combining process of each data file.

The cleaning process removed rows where all columns except for the first are NaNs as well as columns where there were only 1 unique value in each row. Below is a summary of the dataset as well as some visualizations of chosen variables among many that we will consider in this project.

Albumine in Urine (ug/mL) Testing

Creatinine (mg/dL) Testing

Arsenic Total (ug/L) Testing

Triglyceride (mg/dL) Testing

Total Cholesterol (mg/dL) Testing

Hemoglobin (g/dL) Testing

Questionnaire Data

Alcohol Data

General Alcohol Consumption

The majority of the survey population has had alcohol at least once in their life.

How Much Alcohol Consumed Per Day

Individuals who have had an average of 1-3 drinks per day over the last year make up around 80% of the data. Individuals who reporting having 4+ drinks per day make up the other 20%, with 4-6 drinks making up 10%, and 7+ making up the other 10%.

Depression Data

In every question asked in the depression questionnaire, most than half of the time, the respondent said not at all. The “feeling tired or having little energy” and “trouble sleeping or sleeping too much” say higher proportions of “several days” and “more than half the days” responses. The next highest not-at-all to other answers ratio was in “poor appetite or overeating”, and the other questions are all fairly even.

Health Insurance Data

# A tibble: 4 × 3
  `Covered by Insurance?` Count Proportion
  <chr>                   <int>      <dbl>
1 Yes                     13671   0.879   
2 No                       1852   0.119   
3 Don't know                 29   0.00186 
4 Refused                     8   0.000514

Around 87.9% of the respondents were covered by insurance, and 11.9% were not.

# A tibble: 7 × 3
  `Insurance Type`                          No   Yes
  <chr>                                  <int> <int>
1 covered_by_chip                        15389   171
2 covered_by_medi_gap                    15462    98
3 covered_by_medicaid                    11381  4179
4 covered_by_medicare                    12968  2592
5 covered_by_other_government_insurance  14552  1008
6 covered_by_private_insurance            8457  7103
7 covered_by_state_sponsored_health_plan 14623   937

The most common type of insurance was a private insurance plan, followed by medicaid, medicare, and other government insurance.

Access to Healthcare and Hospital Usage Data

Respondents reported that they were generally in execellent or good health conditions.

Most respondents also have a consistent place to go to for health care, such as an urgent care or primary care physician.

Occupation Data

Most respondents are working between 35-40 hours per week.

A majority of the respondents are working at a job or business, followed by a good proportion of those who are out of work.